Dataset-3

Movie Profits



movieprofit <- read_delim("../../data/movie_profit.csv")
Rows: 3310 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (4): movie, distributor, mpaa_rating, genre
dbl  (4): production_budget, domestic_gross, worldwide_gross, decade
num  (1): profit_ratio
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(head(movieprofit))
# A tibble: 6 × 10
  release_date movie            production_budget domestic_gross worldwide_gross
  <date>       <chr>                        <dbl>          <dbl>           <dbl>
1 2005-07-22   November                    250000         191862          191862
2 1998-08-28   I Married a Str…            250000         203134          203134
3 1997-03-28   Love and Other …            250000         212285          743216
4 2000-07-14   Chuck&Buck                  250000        1055671         1157672
5 2011-10-28   Like Crazy                  250000        3395391         3728400
6 2003-04-11   Better Luck Tom…            250000        3802390         3809226
# ℹ 5 more variables: distributor <chr>, mpaa_rating <chr>, genre <chr>,
#   profit_ratio <dbl>, decade <dbl>
glimpse(movieprofit)
Rows: 3,310
Columns: 10
$ release_date      <date> 2005-07-22, 1998-08-28, 1997-03-28, 2000-07-14, 201…
$ movie             <chr> "November", "I Married a Strange Person", "Love and …
$ production_budget <dbl> 250000, 250000, 250000, 250000, 250000, 250000, 2500…
$ domestic_gross    <dbl> 191862, 203134, 212285, 1055671, 3395391, 3802390, 3…
$ worldwide_gross   <dbl> 191862, 203134, 743216, 1157672, 3728400, 3809226, 3…
$ distributor       <chr> "Other", "Other", "Other", "Other", "Paramount Pictu…
$ mpaa_rating       <chr> "R", NA, "R", "R", "PG-13", "R", "R", "R", "R", "R",…
$ genre             <chr> "Drama", "Comedy", "Comedy", "Drama", "Drama", "Dram…
$ profit_ratio      <dbl> 7.674480e+13, 8.125360e+13, 2.972864e+14, 4.630688e+…
$ decade            <dbl> 2000, 1990, 1990, 2000, 2010, 2000, 2010, 2000, 2010…
inspect(movieprofit)

categorical variables:  
         name     class levels    n missing
1       movie character   3310 3310       0
2 distributor character      6 3268      42
3 mpaa_rating character      4 3180     130
4       genre character      5 3310       0
                                   distribution
1 10 Days in a Madhouse (0%) ...               
2  Other (53.2%), Warner Bros. (11%) ...       
3 R (46.4%), PG-13 (33.5%), PG (17.4%) ...     
4 Drama (36.5%), Comedy (24.1%) ...            

Date variables:  
          name class      first       last min_diff  max_diff    n missing
1 release_date  Date 1936-02-05 2017-12-22   0 days 2592 days 3310       0

quantitative variables:  
               name   class      min           Q1       median           Q3
1 production_budget numeric 2.50e+05 9.500000e+06 2.000000e+07 4.500000e+07
2    domestic_gross numeric 0.00e+00 6.530094e+06 2.558731e+07 6.046695e+07
3   worldwide_gross numeric 4.23e+02 1.086144e+07 4.040902e+07 1.184703e+08
4      profit_ratio numeric 1.38e+10 7.861269e+13 1.962499e+14 3.942158e+14
5            decade numeric 1.93e+03 1.990000e+03 2.000000e+03 2.010000e+03
           max         mean           sd    n missing
1 1.750000e+08 3.326794e+07 3.460741e+07 3310       0
2 4.745447e+08 4.551509e+07 5.852794e+07 3310       0
3 1.162782e+09 9.384123e+07 1.389514e+08 3310       0
4 4.315179e+16 4.319388e+14 1.501736e+15 3310       0
5 2.010000e+03 1.998785e+03 1.061308e+01 3310       0
skim(movieprofit)
Data summary
Name movieprofit
Number of rows 3310
Number of columns 10
_______________________
Column type frequency:
character 4
Date 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
movie 0 1.00 1 35 0 3310 0
distributor 42 0.99 5 18 0 6 0
mpaa_rating 130 0.96 1 5 0 4 0
genre 0 1.00 5 9 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
release_date 0 1 1936-02-05 2017-12-22 2005-06-30 1723

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
production_budget 0 1 3.326794e+07 3.460741e+07 2.50e+05 9.500000e+06 2.000000e+07 4.500000e+07 1.750000e+08 ▇▂▁▁▁
domestic_gross 0 1 4.551509e+07 5.852794e+07 0.00e+00 6.530094e+06 2.558731e+07 6.046695e+07 4.745447e+08 ▇▁▁▁▁
worldwide_gross 0 1 9.384123e+07 1.389514e+08 4.23e+02 1.086144e+07 4.040902e+07 1.184703e+08 1.162782e+09 ▇▁▁▁▁
profit_ratio 0 1 4.319388e+14 1.501736e+15 1.38e+10 7.861269e+13 1.962499e+14 3.942158e+14 4.315179e+16 ▇▁▁▁▁
decade 0 1 1.998790e+03 1.061000e+01 1.93e+03 1.990000e+03 2.000000e+03 2.010000e+03 2.010000e+03 ▁▁▁▃▇


movieprofit_modified<- movieprofit %>%
  mutate(
    mpaa_rating = as.factor(mpaa_rating),
    genre = as.factor(genre),
    decade = as.factor(decade)
  )
glimpse(movieprofit_modified)
Rows: 3,310
Columns: 10
$ release_date      <date> 2005-07-22, 1998-08-28, 1997-03-28, 2000-07-14, 201…
$ movie             <chr> "November", "I Married a Strange Person", "Love and …
$ production_budget <dbl> 250000, 250000, 250000, 250000, 250000, 250000, 2500…
$ domestic_gross    <dbl> 191862, 203134, 212285, 1055671, 3395391, 3802390, 3…
$ worldwide_gross   <dbl> 191862, 203134, 743216, 1157672, 3728400, 3809226, 3…
$ distributor       <chr> "Other", "Other", "Other", "Other", "Paramount Pictu…
$ mpaa_rating       <fct> R, NA, R, R, PG-13, R, R, R, R, R, R, R, PG-13, NA, …
$ genre             <fct> Drama, Comedy, Comedy, Drama, Drama, Drama, Action, …
$ profit_ratio      <dbl> 7.674480e+13, 8.125360e+13, 2.972864e+14, 4.630688e+…
$ decade            <fct> 2000, 1990, 1990, 2000, 2010, 2000, 2010, 2000, 2010…
data_dictionary <- data.frame(
  Variable = c("release_date", "movie", "production_budget", "domestic_gross",
               "worldwide_gross", "distributor", "mpaa_rating", "genre", 
               "profit_ratio", "decade"),
  Data_Type = c("Date", "Character", "Numeric", "Numeric",
                "Numeric", "Character", "Factor", "Factor", 
                "Numeric", "Factor"),
  Description = c("The release date of the movie", "Title of the movie", 
                  "Budget allocated for the movie", "Gross revenue earned domestically", 
                  "Total gross revenue worldwide", "Company distributing the movie", 
                  "Rating assigned by the MPAA", "Genre of the movie", 
                  "Ratio of profit to production budget", "decade in which the movie was released")
)


kable(data_dictionary)
Variable Data_Type Description
release_date Date The release date of the movie
movie Character Title of the movie
production_budget Numeric Budget allocated for the movie
domestic_gross Numeric Gross revenue earned domestically
worldwide_gross Numeric Total gross revenue worldwide
distributor Character Company distributing the movie
mpaa_rating Factor Rating assigned by the MPAA
genre Factor Genre of the movie
profit_ratio Numeric Ratio of profit to production budget
decade Factor decade in which the movie was released
  • Qualitative Variables: movie, distributor, mpaa_rating, genre, decade

  • Quantitative Variables: production_budget, domestic_gross, worldwide_gross, profit_ratio

  • Temporal Variable: release_date



gf_bar(genre ~ profit_ratio | distributor, data = movieprofit_modified) 
Warning: Ignoring unknown aesthetics: .
Warning: The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: ..
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

glimpse(movieprofit_modified)
Rows: 3,310
Columns: 10
$ release_date      <date> 2005-07-22, 1998-08-28, 1997-03-28, 2000-07-14, 201…
$ movie             <chr> "November", "I Married a Strange Person", "Love and …
$ production_budget <dbl> 250000, 250000, 250000, 250000, 250000, 250000, 2500…
$ domestic_gross    <dbl> 191862, 203134, 212285, 1055671, 3395391, 3802390, 3…
$ worldwide_gross   <dbl> 191862, 203134, 743216, 1157672, 3728400, 3809226, 3…
$ distributor       <chr> "Other", "Other", "Other", "Other", "Paramount Pictu…
$ mpaa_rating       <fct> R, NA, R, R, PG-13, R, R, R, R, R, R, R, PG-13, NA, …
$ genre             <fct> Drama, Comedy, Comedy, Drama, Drama, Drama, Action, …
$ profit_ratio      <dbl> 7.674480e+13, 8.125360e+13, 2.972864e+14, 4.630688e+…
$ decade            <fct> 2000, 1990, 1990, 2000, 2010, 2000, 2010, 2000, 2010…
median_profit_data <- movieprofit_modified %>%
  group_by(genre, distributor) %>%
  summarize(median_profit_ratio = median(profit_ratio), .groups = "drop")  


ggplot(median_profit_data, aes(x = median_profit_ratio, y = genre)) +
  geom_col() +  
  facet_wrap(~ distributor) + 
  labs(
    title = "ratio of profits to genre",
    x = "Median Profit Ratio",
    y = "Genre"
  ) 

  • horizontal bar plots

  • genre and median of profit ratio are the two variables being plotted

  • action and horror

  • facet_wrap(~ distributor) to get separate plots for each distributor

  • would have hate to mutate and calculate profit ratio